We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. We have two measures of growth rate - basal area and biomass. A gradient boosted model seems like a good candidate for this work since they:
A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.
A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.
GBMs are an ensemble of decision trees, nut they are fit sequentially. We call GBMs an ensemble of weak learners as each subsequent tree is an attempt to correct the errors of the previous tree. Thus, while one tree, by itself, can not describe the relationships, with the use of all the trees, we can. Below is a figure by Bradly Bohemke that attempts to illustrate how each subsequent tree improves the fit on the data.
Below is a list of the changes made to the raw data:
Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.lm model with Variable ~ Sample date when the value<0.05. This resulted in changes for the following variables - Estem, Branching Distance, Stem Wood Density, LMA, LNC, d15N, Ks, X.Sapwood, d13C, Soil.Fertility, Light, Temperature, pHFirst, we can import data and select the labels needed for plotting later.
# prior error suggests that there are single quotation marks used to parse numbers
# I specify that we should read those as a thousand separator
rgr_msh_residuals_julian_df <- read_csv(here("data/rgr_msh_residuals_julian_df.csv"),
guess_max = 10000,
col_types = cols())
# these are the labels for the variables - needed for plotting
labels_rgr_msh <- read_csv(here("data/labels.csv"),
guess_max = 10000,
col_types = cols())
# this helps in listing the variables we actually need to keep, depending on the analyses
PlantTraitsKeep <- labels_rgr_msh %>% filter(Class == 1) %>% select(Feature) %>% reduce(c)
EnvironmentalVariablesKeep <- c( "Soil.Fertility", "Light", "Temperature",
"pH", "Soil.Humidity.Depth ", "Slope")
OthersKeep <- labels_rgr_msh %>% filter(Class%in% c(3,4, 5,7) )%>% select(Feature) %>% reduce(c)
OthersKeep <- c("SampleID", "Site", "Species", "julian.date.2011",
OthersKeep[!OthersKeep%in% c("SampleID", "Site", "Species", "julian.date.2011")])Now, we can create the training and test set - 70:30 is default, usually.
# need a training and test set assess the performance of the model
# a 70:30 split is typical
# first, I will work with the imputed data
set.seed(634)
split_train_test <-
initial_split(
data = rgr_msh_residuals_julian_df,
prop = 0.70)
rgr_msh_na <-
split_train_test %>% training() %>% mutate(Group = "Train")%>%
bind_rows(split_train_test %>% testing() %>% mutate(Group = "Test"))First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_92"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 6
##
## $min_rows
## [1] 1
##
## $nbins
## [1] 128
##
## $nbins_cats
## [1] 2048
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3433.695
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.54
##
## $col_sample_rate
## [1] 0.39
##
## $col_sample_rate_per_tree
## [1] 0.76
##
## $min_split_improvement
## [1] 0
##
## $histogram_type
## [1] "UniformAdaptive"
##
## $categorical_encoding
## [1] "Enum"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Tree.Age"
##
## $y
## [1] "BIO_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bio_residuals <-
gbm(BIO_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BIO_GR"))),
n.trees = 1000,
interaction.depth = 6, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 17, #col_sample_rate
bag.fraction = 0.54, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Does the test error change when we use the average for that species?
Let’s explore the interactions in these data.
Now, we plot interactions with values>0.1.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_3"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 14
##
## $min_rows
## [1] 2
##
## $nbins
## [1] 512
##
## $nbins_cats
## [1] 128
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3593.317
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.36
##
## $col_sample_rate
## [1] 0.96
##
## $col_sample_rate_per_tree
## [1] 0.2
##
## $min_split_improvement
## [1] 1e-04
##
## $histogram_type
## [1] "RoundRobin"
##
## $categorical_encoding
## [1] "Enum"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Tree.Age"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR"))),
n.trees = 1000,
interaction.depth = 14, #max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 23, #col_sample_rate
bag.fraction = 0.36, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Does the test error change when we use the average for that species?
Let’s explore the interactions in these data.
Now, we plot interactions with values>1.01763794414045e-14
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_24"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 11
##
## $min_rows
## [1] 1
##
## $nbins
## [1] 512
##
## $nbins_cats
## [1] 2048
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3587.823
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.65
##
## $col_sample_rate
## [1] 0.63
##
## $col_sample_rate_per_tree
## [1] 0.83
##
## $min_split_improvement
## [1] 1e-04
##
## $histogram_type
## [1] "QuantilesGlobal"
##
## $categorical_encoding
## [1] "Enum"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope" "Species"
## [7] "Tree.Age"
##
## $y
## [1] "BIO_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bio_residuals_species <-
gbm(BIO_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BIO_GR"))) %>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 11, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 5, #col_sample_rate
bag.fraction = 0.65, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Let’s explore the interactions in these data.
Now, we plot interactions with values>0.15.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_86"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 9
##
## $min_rows
## [1] 1
##
## $nbins
## [1] 32
##
## $nbins_cats
## [1] 64
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3530.433
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.7
##
## $col_sample_rate
## [1] 0.81
##
## $col_sample_rate_per_tree
## [1] 0.2
##
## $min_split_improvement
## [1] 1e-04
##
## $histogram_type
## [1] "RoundRobin"
##
## $categorical_encoding
## [1] "Enum"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope" "Species"
## [7] "Tree.Age"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals_species <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR"))) %>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 9, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 4, #col_sample_rate
bag.fraction = 0.81, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform?
Let’s explore the interactions in these data.
Now, we plot interactions with values>0.15.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.